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(54) IVIethods and system for using web browser to search large collections of documents 



(57) A system (100-114) for rapidly and easily 
searching large collections of documents (106) using 
standard web browser programs (1 00) as the user inter- 
face. The present invention parses a collection of text 
documents (108) to identify symbols therein and builds 
a database file (108) which identifies the file and line 
locations of each symbol identified. The database file 
(108) is constructed to permit rapid searching for sym- 
bols to permit interactive use of the present invention as 
a search tool (200-312). A database client process 
(102) interacts with the web browser (100) via standard 
CGI techniques to convert browser commands and que- 
ries into appropriate server process requests. A sender 
process (104) receives such requests and manipulates 
the database files (106) in response to the requests. 
Query results returned to the client process (102) are 
then reformatted by the client process (1008) to retum a 
document with hypertext links in place of search keys 
located in the database (e.g., an HTML page). TTie sys- 
tem of the present invention thereby provides for rapid 
searching of large collections of text documents which 
is not coupled to a specific toolset used to create any 
one of the documents and which uses a simple and 
well-known user interface, namely: web browsers. 
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Description 

Background of the Invention 

1. Field of the Invention. 

[0001] The present invention relates to systenis for browsing documents and in partioilar to methods and systems 
for using a web browser to quickly search large collections of documents such as arbitrary text documents. 

2. Discussion of Related Art. 

[0002] It is common to use a computer to assist a user in browsing through large collections of documents. For exam- 
ple, patent attorneys and patent examiners frequently review large patent documents or collections of related patent or 
legal documents. Or, for example, computer programmers frequently browse large flies of computer source language 
programs or collections of related source language programs. Computers are applied to assist in such situations to 
improve, in particular, the speed of searching for symbols or keywords in the collection of documents. Manually search- 
ing large collections of documents can be extremely cumbersome and unproductive. 

[0003] Text editors or word processors on computer systems are known to allow such browsing by simple sequential 
paging or scrolling through the documents or by search capabilities to locate particular words or phrases. However, 
such known techniques typically do not use indexed searching techniques to locate desired search terms in the docu- 
ment(s). Indexed searches are those which use an index to rapidly locate occurrences of a particular symbol or keyword 
in the text. Rather, simple linear search techniques are most commonly utilized by known text editor or word processing 
techniques. Such simple linear search techniques are impractical when scaled up to very large collections of docu- 
ments. Simple, non-indexed search techniques cannot provide adequate perfbmiance when used in very large collec- 
tions of documents. 

[0004] For example, a team of programmers may need to rapidly search for related terms or phrases in the collection 
of source code files which implement an operating system. One such operating system, by way of example, comprises 
over 1 4,000 directories including 70,000 files totaling over 40,000,000 lines of source code. Simple, nonnndexed search 
techniques are inadequate for such large collections of files. 

[0005] To aid in browsing applications for computer programmers, source code browser programs are often included 
in program development environments (e.g., in computer aided software engineering (CASE) toolsets). Source code 
browser programs are usually tightly coupled to the underlying program development package and therefore are only 
operable in conjunction witii the corresponding tools. However, source code browsers do not in general provide brows- 
ing service for arbitrary text documents outside tiie context of the program development tools. Furthermore, they are 
often constrained by the underlying databases which control the operation of the program development toolset. The 
databases which contain design information regarding a software development "project" often cannot handle such large 
collections of files as noted above. Lastiy, different source code browser programs each provide a unique user irrterface 
potentially forcing a user to learn a proprietary user interface in order to scan collections of documents. 
[0006] In a related aspect of browsing tiirough documents, the Internet World-Wide Web (WWW) utilizes a web 
browser program at tfie user's computer (a web client program) to access information provided at a web server site. The 
protocols and standards which define WWW include hypertext links embedded within a document (also refenred to 
herein as links or hyperlinks) as defined by the Hypertext Markup Language (HTML) standards and as communicated 
via the Hypertext Transfer Protocol (HTTP). A link is an object on a page of information which links to otiier related infor- 
mation. In standard WWW web browser programs, the user can move to this related information by simply "clicking" the 
link as it is displayed on the user's computer screen. 

[0007] Links (or hyperlinks) are also known outside tiie context of HTML web browsing progranr^. For example, "help" 
files as commonly provided in operating systems and applications such as Microsoft Windows or Microsoft Office tools 
are often designed witii hyperlinks to permit tiie user to thereby navigate among related help messages and topics. Fur- 
ther, web browsers are known to understand protocols other than HTML and to use hyperlinks therewith. For example, 
most web browsers also support the f De transfer protocol (FTP) wherein f Oe system directories may be viewed as a tree 
structure and the files and subdirectories tiierein displayed by the web browser as hyperlinks. 
[0008] Web browser programs, per se, provide no indexed searching capability for the information presentiy displayed 
on the users computer display or related information referenced by links in the present display. Rather, as for text and 
word processors noted above, the web browser programs, per se, offer mere sequential search of information presentiy 
displayed on the user's computer screen. 

[0009] Associated with the WWW are a number of web server sites functioning as "search engines" which provide 
access to indexed information to locate web pages tiiat are of interest to a user. In general, these search engines 
search large, proprietary databases for matches against a set of user supplied keywords. A list of web pages which 
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match the user's supplied keyword search is then returned to the user's web browser. The list of matching web pages 
is presented by the web browser program on the user's computer display as a list of links to the matching web pages. 
The user may then select one of interest and di* the link to visit that web page. 

[001 0] Standard features of such web browser programs allow simple "navigation" on the web. For example, standard 
5 features include the ability to move fonward or backward over a chain a linked web pages. A first web page visited may 
provide a link to another page of interest and so on. Multiple such links may be thought of as a chain. Once having nav- 
igated to one page in such a chain of linked pages, the web browser provides standard features to navigate fonward or 
backward on the chain of links already visited. 

[001 1 ] Present web search engines provide an initial list of web pages that may be of interest to the user in accord- 
10 ance with the keyword search terms provided. Once the user is viewing a particular web page so located, the informa- 
tion on the page is merely displayed as originally designed by the information provider of that web page. In other words, 
there is no capability provided by the web search engine to provide further searching within the particular web page 
being viewed. As noted above, the web browser program (the web client program) may provide simple linear search 
capability for text viewed on the w* page. However, also as noted above, such simple linear searching of a large coi- 
rs lection of documents can be quite inefficient. No efficient, indexed search capability is provided by present search 
engines or present browser programs to rapidly locate arbitrary text in a large document or collection of documents. 
[0012] It can be seen from the above discussion that a need exists for a text search capability that is efficient at 
searches of large collections of documents and is easy to use providing a simple, standardized user interface. 

20 Summary of the Invention 

[001 3] The present invention solves the above and other problems, thereby advancing the state of the useful arts, by 
providing a system and associated methods for using web browser programs to efficiently search large collections of 
documents. More specifically, the present invention enhances a web browser so as to enable a user to quickly find per- 

25 tinent information in a set of documents so large it cannot be printed, read, or even linearly searched interactively by a 
user with or without a computer. Still more specifically the present invention dynamically builds a database of search 
keys or symbols found in a collection of text documents. Other aspects of the present invention, integrated within the 
web browser program, search the database to locate a desired synribol, keyword, or file in the collection of text docu- 
ments and display the search results on the web browser display The search results are converted to a page having 

30 hyperlinks (e.g.. HTML hyperlinks) for each search key (symbol or filename) found in the collection of text documents. 
The converted page is then displayed on the computer screen by the web browser. 

[0014] The building of the database is performed by a database builder process of the present invention. The builder 
process parses the collection of text documents to identify symbols (also referred to herein as tokens or keywords) 
within the collection of text documents. The parser may be generalized so as to enable useful parsing of a wide variety 
35 Of textual document formats. Preferably, a plurality of parsing components are associated vnth the builder process. Each 
parser component of the builder process is optimized for parsing of a particular type of document. The database file of 
search keys is built from the parsed documents. The documents may be parsed to a tokenized level that encorrpass^ 
literally every word of a text document (or less than every word if desired by the user). 

[001 5] The present invention permits a large collection of documents, including very large documents, to be viewed 
40 as a single related project of information. The hypertext version of any document (or search results) as displayed by the 
web browser allows quick access to related text without the need for the user to construct new search tenns and use a 
costly linear search for each new concept to be searched. Furthermore; the present invention utilizes a well known user 
interface regardless of the type of document(s) being searched. The familiar user interface of web browser programs is 
utilized by the present invention to search large collections of documents. In addition, the present invention permits the 
45 same rapid, easy to use searching of documents regardless of the original source language and nature of the docu- 
ment. In other words, any text documents may be searched as a single project The documents may be a heterogene- 
ous mix of simple text documents, program source code text in any of several well known programming languages, etc. 
So long as the document may be parsed as a text file (or converted to a text f fle for parsing), the present invention allows 
the user to search the documents as a single related set of documents. Code browsers, as presently known, generally 
50 permit only searching of related documents (i.e.. all documents that are in a common programming language and are 
part of a "projecT as defined within the software development toolset). Other descriptive text documents cannot be eas- 
ily integrated with the source code documents to permit broader searching by a code browser of concepts in related 
text. 

[001 6] The database f fle vses a hash table data stmclure that permits extremely rapid searching to locate symbols. 
55 The record con-esponding to the symbol s located then provides a list of file entries each corresponding to one of a 
subset of files from the source files provided in which the corresponding symbol is found. Linked to each file entry is a 
list of line entries each representing a line number where the located syrnbol is found in the corresponding ffla This 
hash structure permits rapid searching of the collection of documents for a specified symbol. A prefen-ed physical 
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embodiment of this hash table structure optimizes the structure both to reduce memory space requirements and to 
reorder the stored information In the database file to improve access performance thereto. 

[001 7] The database builder can be provided a list of tokens to be ignored in the search keys. For example, common 
connector words or keywords can be ignored since they appear so often. English connectors such as: "the", "and", etc. 
5 or programming language keywords such as "Int". "c^ar", etc. may be eliminated from the database by instructing the 
database builder to ignore them. 

[0018] The present invention preferably accesses the database ising a client/server model wherein a single server 
process coordinates shared access to one or more databases by multiple client processes. Such a database server 
process runs continuously on a conrtputing node to provide shared access to databases built by the database builder 
10 module. A database client process provides a link between standard web browser programs and the database server 
process. As known to those skilled in the art, the database client process is actually coupled to a web server process 
which directly invokes tiie client process. The processing performed by the web server process in this situation is stand- 
ard and so trivial as to be essentially ignored herein. 

[001 9] Using well known web browser integration techniques such as common gateway interface (CX?!). the database 
15 client receives browser queries for a collection of documents. The client process then converts the browser commands 
and queries into appropriate sen/er commands for processing by the server process. The server commands are then 
transmitted to the server process and query results (or command status) received in return. The results returned from 
the server process are lokenized" in that each symbol or keyword tiiat appears in the results is delimited as a token. 
The client process recognizes tiie toker^ and converts them to a format appropriate to the browser program. Preferably, 
20 the results are reformatted into formats corrpatible with web browsers having hyperlinks corresponding to each poten- 
tial symbol or keyword found in tiie search results. This converted format is th^ displayed by the standard web browser 
with hyperlinks for each search key to permit simple linking to related lines and files in tiie text documents. 
[0020] The present invention is preferably implemented i^ing well knovm network programming techniques for tiie 
client/server model interprocess communications. In particular, nuiltiple w^ browsers can connect to a single database 
25 client process, multiple database client processes may connect to a single database server process, and a single data- 
base server process can service requests for multiple databases. Furthermore, the several components: web browsers, 
database client(s). and database server(s) may disti^ibuted over any number of interconnected computers or on a single 
computer. 

[0021 ] Additional advantages of tiie invention will be set fortii in part in the description which follows, and in part will 
30 be obvious from tiie description, or may be learned by practice of the invention. The advantages of the invention may 
be realized and attained by means of instrumentalities and combinations particularly pointed out in the appended 
claims and depicted in the figures as follows. 

Brief Description of the Drawings 

35 

[0022] 

Rgure 1 is a block diagram of the elements of tiie present invention. 

Rgure 2 is a block diagram providing a logical description of the sbucture database file of tiie present invention of 
40 figure 1 . 

Rgure 3 is a block diagram providing the preferred physical embodiment of tiie structure of tiie database file of fig- 
ure 1 and as logically depicted in figure 2. 

Rgure 4 is an exemplary computer display screen showing the results of a symbol in files query operation t>y the 
present invention of figure 1 . 

45 Rgure 5 is an exemplary computer display screen showing the results of another symbol in files query operation by 
the present invention of figure 1 . 

Figure 6 is an exemplary computer display saeen showing the results of a substring In symbols query operation by 
the present invention of figure 1 . 

Rgure 7 is an exenplary computer display screen showing the results of a substring In patl^ query operation by 
50 the present Invention of figure 1 . 

Rgure 8 is a flowchart describing operation of tiie database builder in-accordance witii the present invention of fig- 
ure 1. 

Rgure 9 is a flowchart describing operation of a web browser in accordance witii the present invention of figure 1 : 
Rgure 1 0 is a flowchart desaibing the operation of a database client process in accordance witii the present inven- 
55 tlonof figure 1. 

Rgure 11 is a flowchart describing the operation of a database server process in accordance with tiie present 
invention of figure 1. 
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Detailed Description of the Preferred Embodiments 

[0023] While the invention is susceptible to various modifications and alternative forms, a specific embodiment thereof 
has been shown by way of example in the drawings and will herein be described in detail. It should be understood, how- 
ever, that it is not intended to limit the invention to the particular form disclosed. Ixjt on the contrary, the invention is to 
cover all modifications, equivalents, and alternatives failing witiiin the spirit and scope of the invention as defined by ttie 
appended claims. 

ARCHITECTURAL OVERVIEW 

[0024] Figure 1 is a block diagram of the present invention in which a web browser 100 searches for symbols in a 
collection of text documents 106 by using the database file 108. As shown in figure 1 , web browser 100 locates infor- 
mation using database file 108 by a requesting a information from database client 102 via path 150. Results of such 
requests are returned from database client process 102 to web browser 100 via path 160. 

[0025] Database client process 102. in response to receipt of a request from web browser 100. passes database 
request information to database server process 1 04 via path 1 52 for actual processing utilizing database ffle 108. In like 
manner, database server process 1 04 returns results of ttie requests to database client process 1 02 via path 1 58. Data- 
base server process 104 perfomis the requested operations on database file 108 retrieving requested infomiation via 
path 154. In addition, database server process 104 retrieves actual text from the collection of text documents 106 via 
patti 156. 

[0026] Those skilled in the art will readily recognize that figure 1 is intended merely as a schen^tic diagram of one 
exemplary embodiment of tiie present invention. In particular, web browser 1 00, database client process 102, and data- 
base server process 104 may all be processes resident and operable witiiin a single confuting system, or may be dis- 
tributed over a plurality of computing systems and communicate using well-known inter-process, network 
communication techniques. Furthermore, database file 1 08 and collection of text documents 106 may reside physically 
on storage devices locally accessed by database server process 104, or may themselves reside on remote computing 
nodes accessible via paths 1 54 and 1 56 respectively 

[0027] Database client 102 is shown in figure 1 as including a "tiiin" web server layer. Those skilled in the art will rec- 
ognize that the web browser 100 is a dient program tiiat is served by a web server process. As described herein, the 
web sender process is a Ihin" layer in the sense that it performs little processing of interest with respect to the present 
invention. To be precise, the web server process provides the actual interface to the database client process on behalf 
of tiie web browser process. Further, as is known in tiie art, a single web server process may be in communication with 
multiple web browser processes. The web server process may therefore invoke multiple database client processes on 
behalf of multiple web browser requests. The web server is sakJ to be multi-threaded in this sense. For all practical pur- 
poses in describing the present invention, the web server layer may be ignored in favor of a description which focuses 
on the logical connection of the web browser 100 to the database client process 102. 

[0028] In addition, those skilled in tiie art will recognize that tiie clientfeerver model d^icted in figure 1 as database 
client process 102 and database server process 104 is but one exemplary prefen-ed embodiment of the present im^en- 
tion. More generally, web browser 100 searches for symbols witiiin the collection of text documents 106 by utilizing 
indexing information stored in database file 108. The client/server model depicted in figure 1 localizes and modularizes 
various functions required to permit such database access by multiple users and web browsers. In particular, as noted 
above, a plurality of a web browsers 1 00 may communicate with a single shared web sender process (a layer essentially 
ignored for purposes of describing the present invention). A single web sender may invoke a plurality of database client 
processes on behalf of multiple web browsers due to tiie multi-tiireaded nature of the web sender process. Further, a 
plurality of database client processes 102 may communicate with a single shared database server process 104. Multi- 
ple database server processes 104 may share simultaneous access to database file 108 and collection of text docu- 
ments 106. Coordination of such shared communication pattis is well-known in tfie art using standard dient/server 
inter-process communication techniques. 

[0029] In operation, a user, utilizing the well-known graphical user interface of web browser 100, fills in blanks on a 
display screen to compose a query for locating desired syn^ls, keywords, or ffles in tfie collection of text documents 
106. The user's query is preferably a simple list of one or more synftbols or keywords together with related confrol 
parameters to be located in the collection of text documents 106. The query so constructed is communicated to the 
database client process 102 utilizing well-known common gateway interface (CGO techniques. The CGI specification is 
a well-known standard maintained by a tfie National Center for Super Computing Applications (NCSA) in Urbana- 
Champaign IL at tiie University of Illinois. 

[0030] Query information is communicated from web browser 1 00 to database client process 1 02 via path 1 50 utilizing 
these CGI techniques. In like manner, database client process when 1 02 returns results of tine query information to web 
browser 100 via path 160 using tfie CGI standards. Database client process 102 re-fomiats ttie query information into 
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an internal request format defined by database server process 1 04. Hie transformed query information is then conrvnu- 
nicated from database client process 1 02 to database server process 1 04 via path 1 52 using a well-known dient/server 
inter-process communication techniques. A more detailed description of a prefenred embodiment of such an internal 
format is provided herein below. 

[0031 ] Database server process 104 performs the requested query, accessing database f De 1 08 and collection of text 
documents 106 via paths 154 and 156, respectively, to retrieve the requested information. The retrieved information Is 
transmitted back to database client process 102 using an internally defined format (described herein below) via path 
158. 

[0032] Database client process 102 then performs necessary re-formatting of the returned information and then for- 
wards the information on to web browser 100 as returned data via path 160. Specifically, database client process 102 
transforms the internally defined response format of database server process 104 into an HTML representation the^eof 
(or other format having hyperlinks defined therein for each symtx)! or keyword). Symbols or other keywords requested 
by the original query are presented as hypertext links in the returned HTML query results. Details of the specific format 
are dependent upon the particular query being perfonned as discussed herein below. 

DATABASE FILE BUILDER PROCESS OPERATION 

[0033] Database builder process of 1 10 operates in conjunction with one or more parsers 112 and 1 14 to construct 
database file 108 by parsing the collection of documents 106. A shown in figure 1. database builder process may be 
initially invoked as an offline procedure, or as a request initiated by web browser 100 directly or indirectly via database 
client process 102. Those skilled in the art will recognize the equivalence of many techniques for invoking database 
builder process 110. For example, database builder process 110 may be automatically invoked by additional proce- 
dures (not shown) which recognize changes made in collection of documents 106. 

[0034] A shown in figure 1, database builder process 1 10 is operable with one or more parsers 1 12 and 114. Each 
such parser 112 or 114 may be adapted for optimally parsing a particular type of source documents language. For 
example a first parser may be optimally adapted for parsing a particular computer source programming language while 
another parser may be particularly well-suited to parsing of legal documents. When invoked to create a database file 
108 from collection of text documents when 106, database builder process 110 may be instructed as to the preferred 
parser to be used with each document in the collection of text documents 1 06. Altematively, as will be apparent to those 
skilled in the art, database builder process 110 and/or parsers 1 12 and 114 may automatically detect the type of a par- 
ticular source document and associate a preferred parser therewith. 

[0035] Figure 8 is a flowchart describing the operation of database builder process 110. Element 800 is first operable 
to obtain a list of source documents from tiie caller or user of database builder process 1 10. As noted above, database 
builder process 1 10 may be invoked directiy by an interactive user or may be invoked by operations of web browser 100 
and/or database client process 102. Element 800 is therefore operable to obtain the list of source documents from the 
calling process or directly from an interactive user. The list of source documents provides a path name for each docu- 
ment whose symbols are to be indexed in the resultant database file. Element 802 is next operable to obtain the list of 
excluded symbols from the caller or user of database builder process 1 10. It is common in several types of source doc- 
uments to encounter frequentiy used symbols or keywords that are unlikely to be of significant value when searching 
the collection of text documents. For example, in a typical English document, connector words such as "tiie", "or\ "and", 
etc. would substantially increase the size of the database file to be created without adding significant value to the 
search user. Or, for example, C language source code documents may contain frequent instances of keywords such as 
"int", "for", "while", etc. Element 802 is therefore operable to obtain a list from the user or calling process of words or 
symbols to be excluded from flie indexing performed by database buikier process 110. 

[0036] Element 804 is next operable to initialize processing by the database buikier process for tiie first fOe in tiie col- 
lection of text documents provided by the calling process or user. Element in 806 next adds tiie file presently being proc- 
essed to tiie list of files (list of path names) maintained by tiie database builder process. TTie database builder process 
may use well-own computer memory data structures for building the various data constructs of the database file. As will 
be seen below, a final step in tiie database builder process converts such internal memory based data structures into 
conrpressed data sti-uctures preferred for tiie physical embodiment of the database file (as discussed herein below). 
[0037] Element 808 is next operable to select an appropriate parser fbr4he file presentiy being processed. As noted 
above, tiie database builder process may be associated witii one or more parsing processes. Each parsing process 
may be optimized for parsing a particular form or type of source document Element 808 is therefore operable to deter- 
mine tiie type of file presently being processed and to select a preferred parser in association therewith. Those skilled 
in tiie art will recognize tiiat a parser process may be generalized so as to be capable of processing any of several 
source file types. It will tiierefore be recognized tiiat database builder process may be associated with but a single 
parser process as well as with a plurality of parser processes customized or optimized for particular types of docu- 
ments. 
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[0038] Element 810 is next operable to invoke the selected parser to retrieve the next synibol in the present source 
file. This invocation of the parser retrieves both the s/mbol and line number of the next synrix)! found in the source file. 
Element 81 2 is next operable to determine if the retrieved symbol is already present in the symbol list being constructed 
(as below) by the database builder process. If the retrieved symbol is already present in the symbol list associated with 

5 the present file, processing continues to element 818. Ottierwise, element 814 is next operable to determine if the 
retrieved symbol is present in the list of excluded symt}ols provided by the user or calling process. If element 814 deter- 
mines that the retrieved symbol is to be excluded, the symbol is ignored and processing continues by looping back to 
element 810. Otherwise, element 816 is next operable to add the retrieved symbol to the symbol list being constructed 
for the current file. Processing then continues with element 818. 

10 [0039] Element 81 8 is next operable to detennine whether the symbol just retrieved at a partioilar line number in the 
present file is already known to be present at that line number in the line number list being constructed (as below) for 
that symbol of the present file. If the symbol is already known to be present at the identified line of the cun-ent file, 
processing continues with element 822. If the symbol of is not known to be found at the identified line number in the 
present file, element 820 is next operable to add a line number entry to the line number list for the corresponding symbol 

15 in the present file. Element 822 is next operable to determine if the entire file has been processed or if all symbols in 
the present file had been processed as above. If further symbols remain to be processed, processing continues by loop- 
ing back to element 810. Othenvise processing continues at element 824. 

[0040] Element 824 is operable to determine whether additional files in the collection of text documents remain to be 
parsed and processed as described above. If no additional files remain to be processed, processing continues at ele- 
20 ment 828 to generate the persistent storage physical format of the database file as described below with respect to fig- 
ure 3. The above identified operations of database builder process 110 preferably construct in-memory computer data 
structures as well known to those skilled and art. The in memory data structures are traversed by element 828 in a 
sequence to produce the optimal physical storage structure described herein below, if element 824 determines that no 
further files require processing, element 826 is next operable to begin processing the next source document in the col- 
25 lection of text documents. Processing then continues by looping back to element 806 to continue processing with a new 
source file. 

DATABASE FILE STRUCTURE 

[0041 ] Rgures 2 and 3 depict the database file data structures in two forms. Rrst, figure 2 descn'bes the database file 
data structure in logical temris, as logically understood and manipulated by database server process 104 to perform 
requested queries. Rgure 3. by way of contrast is the prefenred physical embodiment of the database file logically 
depicted in figure 2. In particular, the preferred embodiment depicted in figure 3 stores tiie database file in such a man- 
ner as to reduce its total size and to improve performance in accessing the database file. More specifically, tiie preferred 
physical embodiment of database file 108 uses a compressed encoding of integers (as discussed herein below) to 
reduce storage requirements for ttie database file and to improve overall access performance to the database file. Fur- 
thermore, tiie preferred physical embodiment of tiie database file as depicted in figure 3 improves access performance 
thereto by storing the data in a sequence which more closely matches the locality of references typified by database 
server process 104. In otfier words portions of the database file which are likely to be accessed chronologically near 
one anotiier are stored physically near one another to reduce access times required for reading the database f fla. 
[0042] In particular, figure 2 describes a hash table data structure comprising at its highest layer an array of hash table 
bucket pointers 202. As is well-known in tiie art the number of hash table bucket pointers 202 is typically less than tiie 
number of symbols or key values which are "hashed" into indices of tiie hash table. Each table bucket pointer 202 there- 
fore is the head the pointer to a list of symbol entries 204 all of which hash to the same hash value. Each symbol entry 
204 includes identification of the symbol (or keyword also refen'ed to heran as token) as well as a pointer to a list of file 
entries 206. Each entry in the list of file entries 206 represents a file (text document) in tiie collection of text documents 
(also refened to herein as source documents) in which ttie corresponding symbol (token) is found. Each file entry 206 
includes identification of the f De path represented thereby as well as a pointer to a list of line number entries 208. Each 
line number entry 208 indudes identification of a line number within ttie corresponding file at which the corresponding 
symbol is located. 

[0043] The data sti'ucture of figure 2 is exemplary of one preferred logical embodiment of the database file 1 08 struc- 
ture. Those skilled in tiie art will recognize many equivalent data structures to provide indexed search capabilities for 
the symbols found in the collection of text documents. 

[0044] Access time to database file 108 is a critical factor in ttie overall interactive peribrmance of the present inven- 
tion. For tills reason, the logical staicture described in figure 2 may be inadequate to provide ttie requisite interactive 
performance. Specifically, the database file size may be quite large depending on tiie number of symbols found in the 
collection of text documents. In addition tiie logical structure depicted in figure 2 does not inherently provide locality of 
acc^ in ttie data structure. That is to say a first access to ttie database and a second access to the database for 
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related items may require randomly scattered access to differing portions of the database file. 
[0045] Figure 3 is a schematic representation of the prefen-ed physical embodiment of database file 1 08 of the present 
invention. As noted above, the preferred physical embodiment of database file 108 shown in figure 3 utilizes com- 
pressed encoding of integer values (described herein below) to reduce the memory requirements for storage of data- 
base file 108. In addition, the organization of entries in the physical embodiment of database file 108 as depicted in 
figure 3 improves access to database file 108 by improving locality of references. 

[0046] The preferred database file physical structure includes all requisite data in a preferred order as follows. First, 
the number of path names comprising the list of text documents is encoded as in integer value in #paths 300. Next, the 
actual string data of the path names of the collection of text documents Is encoded in paths 302. The strings represent- 
ing the file (path) names of the text documents used to construct the database are concatenated (preferably with an 
intervening separator character) in the order in which they were processed by the database builder process. Next, the 
symbols and associated file and line numbers are stored in a concatenated form for each element of the array of hash 
table bucket pointers (stored later in the physical fonrat). Specifically, each bucket list entry 304 includes the symbols 
(as concatenated strings) in that bucket followed by the list of files and con-esponding line numbers within those files 
where the symbols are located. Symbol and file/line index offsets 306 provide pointers into the bucket list entries 304 
for each distinct symbol in the list of syn*ols for particular bucket list entry 304. Next, hash table chain offsets 308 pro- 
vide offsets the into syn*ol and file/line index offsets 308 indicating the offset of the first symbol in the symbol list asso- 
ciated with the hash bucket pointer. Table offset 310 provides a pointer to the first hash table chain offsets 308. Lastly, 
table size of 312 provides the entire size of the hash table which in turn provides the starting position of symbol and 
file/line index offsets 306. 

INTEGER COMPRESSION 

[0047] All integral offsets and pointer values described above with respect to the preferred physical embodiment of 
database file 108 as shown in figure 3 are compressed to reduce the amount of storage required in the preferred phys- 
ical embodiment of the database file. It is noted that pointer and offset values in such a logical data structure as 
depicted above with respect to figure to often include a significant number of leading zeros. The integer compression 
or used in conjunction with Una present invention reduces the number off leading zeros required to be stored for such 
pointers and offsets. 

[0048] In particular, integer values in the range of 0 through 1 27 are encoded in a single byte whose most significant 
bit is 1 . Integer values in the range 128 tiirough 16.383 are encoded in two bytes whose most significant two bits are 
01. Integer values in the range 16,384 through 2,097.151 are encoded In three bytes whose most significant three bits 
are 001. Lastiy. integer values in the range 2,097,152 through 536,870,91 1 are encoded in four bytes whose most sig- 
nificant three bits are 000. Those skilled int he art will recognize tiiat tiie encoding technique descnbed above may be 
easily extended to representations of larger number using more than four bytes. 

[0049] Metiiods of the present invention decode tiie compressed integer values as information Is accessed from the 
compressed database file. Those skilled in tiie art will readily recognize tiie deconpression technique as min^or images 
of tiie above compression description. The compression and decompression techniques may be ffurtiier understood 
with reference to tiie following C++ language code listings. 
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void 

WriteCompInt(unsigned int v) // compression encoding function 
{ 

// We cannot deal with numb rs so large that their upper bits 

// become tags for numbers presumed smaller. 

// 

if ((V & OxeOOOOOOO) 1= 0) { 
printf("Numbers are too large to create a database.\n" 

'Try building a smaller database.Xn"); 
exit(1); 

} 

if((v&-0x70==0){ 
// Number can fit in one byte with the top bits flagged as 1 
putc(v|0x80.OutFile); 
++CurOffset: 

) 

elseif((v&-0x3fff)==0){ 
// Number can fit in two bytes with the top bits flagged as 01 
putc((v»8) 1 0x40, OutFile): 
putc(v & Oxff. OutFile); 
CurOffset += 2; 

} 

else if ((V & -0x1fffff)==O) { 
// Number can fit in three bytes with the top bits flagged as 001 
putc((v»16) 1 0x20. OutFile): 
putc((v»8) & Oxff. OutFile): 
putc( V & Oxff, OutFile): 
CurOffset += 3; 

} 

else { 

// Number can fit in four bytes with the top bits flagged as 000 

putc((v»24) & Oxif. OutFile): 

putc((v»16) & Oxff. OutFile); 

putc((v»8) & Oxff, OutFile): 

putc(v & Oxff. OutFile): 

CurOffset += 4: 

} 

return; 

} 
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int 

TScanDB::ReadComplnt( // compr ssion decoding function 

unsigned char •TilePtr 
) 

// 

// Read and decompress an integer from the current file pointer location 

// in the memory mapped data base. 

// 

{ 

lf(TiIePtr&0x80){ 
// byte starts with 1, numbers up to 2*7-1 = 127 
*FilePtr+=:1; 

return ((TilePtr)[-11 & 0x7f): 

} 

if(*TilePtr&0x40){ 
// byte starts with 01 , numbers up to 2'^14 - 1 * 16,383 
*FilePtr+=2; 

return (((*FllePtr)[-2] & 0x30 « 8) | (*FilePtr)[-1 1; 

} 

lf(*TllePtr&0x20){ 
// byte starts with 001 . numbers up to 2*21 - 1 = 2.097.1 51 
•FilePtr += 3; 

return (((•FilePtr)[-31 & Oxif) « 16) | 
((•FilePtr)(-21 « 8) | (•FilePtr)[-1l; 

} 

// byte starts with 000. numbers up to 2*29 - 1 = 536,870.912 
•FilePtr +=4: 

return (CFilePtrM « 24) | ((*FilePtr)[-3] « 16) | 
((•FilePtr)[-21 « 8) | ((*FilePtr)[-11); 

} 



[0050] The above integer compression encoding technique provides compression ratios of the database file any- 
where between 1X and approximately 4X. 

[0051 ] Still further integer compression may be achieved vinth a second compression techniques applied in conjunc- 
tion with the at)ove. As discussed atxave vwth respect to figure 3, sequences of integer offset values are concatenated 
in the compressed, preferr^ physical embodiment of the database file of the present invention. As noted above, the 
integer encoding techniques atxjve provide some compression to reduce leading zero bits in integer numbers. The sec- 
ond compression techniques includes redudng offset values which follow a first offset in a sequence of concatenated 
offset values to be relative offsets. The relative offsets provide a delta integer value from the immediate predecessor 
offset value. The first offset value in such a sequence of offset values is the full integer value. The next offset value is 
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relative to the first, the third Is relative to the second, fourth to the third etc. For example the sequence of values: 
100, 200. 500, 1O0O 
5 are encoded as the sequence 
100.100,300.500 

[0052] As an alternative, each subsequent integer offset value in a sequence nnay be relative to the first such value. 
10 For exanple. tiie same sequence: 

100. 200. 500. 1000 

may be encoded as: 

15 

100, 100. 400. 900 

[0053] Clearly the former approach provides superior compression and is therefore preferred. The latter method may 
require less conrputation in that tiie sequence of compressed number need not be completely parsed. Only the leading 
20 bits need be parsed to determine the number of bytes (as described above). The remaining byte which encode the inte- 
ger value need not be accessed to determine the value of a latter value in the sequence 

WEB BROWSER OPERATION 

25 [0054] Rgure 9 is a flowchart describing the operation of a standard web browser as modified to work in conjunction 
witii the metiiods processes and structures of the present invention. Those skilled in the art will recognize that figure 9 
does not describe operation of web browsers in general as presentiy known in the art. Rather, figure 9 desaibes only 
tiie specific features a web browser as adapted to utilize the present invention. In particular, element 900 awaits and 
acc^ts user irput to specify search parameters to be applied to collection of text documents. Not shown are tiie 

30 processing steps which serve to identify either the collection of documents by path names nor the steps to provide tiie 
pre-buiit database file path. Such processing is well-known to those skilled in tiie art 

[0055] Element 902 is next operable to transmit the search parameters accepted by processing of element 900 to tiie 
database client process for further processing. Element 904 than awaits return of tiie search results tiirough processing 
initiating completed via the database client process. Lastly element 906 displays the HTML formatted search results (or 

35 other format having hyperlinks therein) as returned from tiie database client process. Processing continues by looping 
back to element 900 to await a further search parameters. Those skilled in tiie art will recognize that standard web 
browser processing techniques may invoke further processing by clicking a hyperlink in tiie search results displayed by 
operation of element 906 and as returned by operation of the database client process. Furtiier, it will be recognized tiiat 
linear search techniques within standard web browsers may be invoked to further refine the search of the information 

40 returned and displayed on the web browsers computer display screen. 

DATABASE CLIENT PROCESS OPERATION 

[0056] Rgure 1 0 is a flowchart describing metiiods operable within tiie database client process of the present inven- 
45 tion. As noted above, a web browser invokes tiie services of the present invention via tiie database client process using 
the CGI communication gateway standards. The database client process, in turn, communicates with the database 
server process to effectuate the query operations requested by the web browser. Those skilled in art will recognize tiiat 
the features of tiie present invention may be implemented witii or without sudi a client/server architecture. As noted 
above, a web browser (via communications with a web server process) may invoke a database manipulation program 
50 which directly accesses tiie database file rather tiian doing so tiirough a database server process. The client/server 
model of tiie present invention provides benefits in coordinating multiple shared simultaneous access to the database 
file. In addition, tiie database client/server model preferred in tiie current invention permits the web browser, web server, 
and database server processes to be distributed over independent computing nodes. In other words, tiie client/server 
model preferred in the present invention is more easily integrated into a distributed connputing environment wherein 
55 processes communicate in a standardized manner regardless of the physical computing node on which tiiey are oper- 
ating. Lastly, the database server process of the present invention, as discussed in additional detail below, supports a 
query command stream and returns its results essentially in ASCII text. This allows the database server process to be 
developed, tested, and debugged independent of tiie database client process. 
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[0057] Element 1 000 is first operable to receive search parameters of a query request from the web browser. As noted 
above the web browser constructs a search request by accepting search parameters from an interactive user. Those 
search parameters are transmitted directly to the database client process utilizing the CGI interfacing techniques. Ele- 
ment 1002 is nest operable to transform the search parameters received in the seardi request into appropriately for- 

5 matted search commands supported by the database server process of the present invention. Details of the search 
commands so supported are provided herein below. Element 1004 is next operable to transmit the trarsfbrmed (re-for- 
matted) search command to the database server process of the present invention. As noted above, the database client 
process and database server process of the present invention preferably communicate using well-known inter-process 
communication techniques. Such communication techniques simplify coordination of shared access to the database 

w file. 

[0058] Element 1 008 is next operable to await receipt of results of processing the transformed (re-formatted) search 
command previously transmitted to the database server process. As above, the search results are returned from the 
database server process to the database client process utilizing well-known network inter-process communication tech- 
niques. 

15 [0059] Element 1008 Is then operable to transform the search command results returned from the database server 
process into an appropriate page including hyperlinks indicative of the search results. The search results as returned 
from the database server process are formatted in an internal tokenized form as presented herein below. Tokenized 
symbols are transformed by element 1 008 into hyperlinks for generating further query commands potentially of interest 
to the user. Some queries return results which have a pre-defined format as discussed below wherein tokenizing is not 

20 performed by the server process. Rather, elements (symbols or keywords) which may of interest for further search 
processing are clearly defined by the format of the query response. 

[0060] Element 1010 is next operable to transmit the re-fonratled search command results back to the web browser 
which initiated the search request As noted above the web browser and database client process communicate utilizing 
well-known CGI techniques. Processing than continues by looping back to element 1000 to await receipt of further 
25 search requests and associated search parameters from an associated web browser. 

DATABASE SERVER PROCESS OPERATION 

[0061] Rgure 1 1 is a flowchart describing the processing performed by database server process 104 of the present 
30 invention. Element 1 100 is first operable to receive a search command from the database client process. As noted 
above search commands received from the database client process are formatted in an internal format supported and 
defined by the database server process as discussed herein below. Element 1102 is next operable to spawn a thread 
for processing of the received search command. Well-known multi-threaded programming techniques are applied to 
permit multiple search commands to be processed on behalf of multiple database client processes. The multi-threaded 
35 programming technique also permits the server process to more easily "cleanup" on behalf of a failed processing 
thread. For example, failure of a single thread, processing a particular search request, does not impact concurrent 
processing by other threads of other search requests on behalf of other database client processes. The multi-threaded 
aspect of the database server processing is depicted in figure 1 1 by the multiple arrows exiting from processing of ele- 
ment 1 102. The newly spawned thread continues processing with element 1 104-1 1 10 through to completion. The main 
40 line database server process continues processing by looping back to element 1 100 to await receipt of another search 
request from another database client process. 

[0062] The newly spawned thread of the database server process continues with element 11 04 to process the search 
request received from the database client process. Element 1 106 is next operable to determine if the query was for the 
contents of a file (a file contents query as generated by the browser program). If not processing continue with element 
45 1 1 1 0 to transmit the search results to the database client process for further processing on behalf of ttie web browser 
program. If the query is a file content query, processing continues with element 1 108 to tokenize the file content query 
results. 

[0063] As noted above, each symbol in a f fle content query is tokenized by the database server process. In particular, 
each symbol in the file content texl stream is delimited by the TOKEN_START and TOKEN_END characters as dis- 
50 cussed below. The tokenized results are then transnnitted to the database client by operation off element 1 1 1 0 for further 
processing on behalf of he web browser program. 

[0064] Those skilled in the art will readily recognize that elements 1 104-1 1 10 (a single thread of the database server 
process) may operate concurrently to provide streaming of the resultant data back to the database client process 
requesting the search. In other words, as element 1 104 continues to process the search command tfiereby generating 
55 search results, elements 1 1 06-1 1 1 0 may concurrently operate to transmit those results already generated back to data- 
base client process. In this manner, the web browser, the database client process, and database server process may 
alt overlap their processing to provide the desired rapid respor^e to the interactive user of the web browser. Early 
results of the query process are viewable at the users computer display even as later results are yet to be gen&'ated. 
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The search results are said to be streamed from the database server process through to the web browser for display 
on the user's computer saeen. 

[0065] Element 11 04 is described above as performing the specified query through ise of the database f Oe. Such an 
operation is well understood by those skilled in the art in view of the logical description of the database file presented 
above with respect to figure 2. With respect to the preferred physical embodiment of the database file as discussed 
above in conjunction with figure 3. the following pseudo-code listing is helpful in understanding the detailed operation 
of element 1104. 



// MAKE_PTR turns an internal file offset into a real C-language pointer 
// for direct dereference by the C-aintime environment. If the offset was 
// zero, MAKE_PTR returns NULL, otherwise it adds the offset to the base 
// of the memory mapped database in memory. 

// ReadComplntO reads a compressed integer from the database and converts 
it 

// to a normal integer. ReadComplnt() also advances FilePtr to the next byte 
// after the compressed Integer. 

Lookup(Key, MatchCase, OutsideCurlies) 

Hashlndex = HashKey(Key) 

FilePtr = MAKE_PTR(HashTable[Hashlndexl) 

if FilePtr is NULL, this hash chain is empty, so no matches, so exit 

// This outermost loop is executed once for each matdi, the inner 
// loop loops across non-matches within the chain between matches 
// Could still be zero matches if token doesn't exist. 
// Could be 1 match if token exists and we're matching case. 
// Could be many matches if matching case. 

loop, to Find the next match 
loop, to skip over non-matches 
KeyOffset - ReadComplnt(&FilePtr) 
if KeyOffset is NULL, then no match, exit this loop 
KeyOffset = MAKE_PTR(KeyOffset) 
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DataOffset = ReadComplnt(&FilePtr) 

if MatchCas© and case sensitive match between *KeyOffset and Key, or 
case Insensitive match between ^eyOffs t and Key 
then match found, xit this loop 
end loop to skip over non-matches 

If no match found, then exit 

NextKeyOffset = FilePtr 
FllePtr = MAKE_PTR(DataOffset) 
loop, to traverse the list of files 
FileNum = ReadComplnt(&FilePtr) 

if FileNum Is zero, done, exit this loop 

LinesOffset - ReadComplnt(&FllePtr) 
NexlFlleOffset = FllePtr 
FllePtr = MAKE_PTR(LlnesOffset) 
loop, to traverse the list of lines with the token 
LIneNum = ReadComplnt(&FilePtr) 
If LIneNum Is not zero, and OutsldeCurlles Is set. and the 
LineNum Is tagged as OutsldeCurlles 
then MATCH FOUND: process match 
end loop when LIneNum Is zero 
FilePtr =NextFileOffset 
end loop to traverse list of files 
FilePtr = NextKeyOffset 
end loop to find next matdi 



WEB BROWSER/DATABASE CLIENT PROTOCOL 

[0066] The present invention provides for various search requests (also referred to herein as queries) between the 
web browser and the database client process. As noted above, the web browser and datat}ase client process preferat)ly 
communicate using the CGI standards. A query is communicated from the web browser to the database client in 
response to the user entering input search symbols or keywords arxl clicking a button to initiate the search processing. 
The type of search and various parameters relating to the selected search type are then transmitted to the database 
client process. Ihe database dient process and database server process then communicate as discussed herein to 
process the query and to return results thereof to the web browser in the form of an HTML page (or other format having 
hyperlinks). The present invention includes the following four query types. 

RIe contents The file contents query is generated by the web browser to return the contents of one ffle from 

the collection of text documents in the dataliase. The file contents are retrieved and display the 
by the web browser. The results retrieved and returned by the database client and server 
processing include hyperlinks for each symbol that is indexed in the database file for the 
returned text document. The hypertext links will invoke a query corresponding to the symbol for 
a "symbol in files' query as descnlied herein below. 



Substring in paths A substring in paths query is generated by the wds browser to request a list of filenames for text 
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documents in the collection of text documents which match a specified string. Each filename 
returned by the query results is displayed by the web browser as a hyperlink which specifies a 
file contents query for the corresponding file (as described above). 

Substring in symbols A substring in symbols query command is generated by the web browser to request the list of 
symbol names that contain a spedf ied substring. As returned from the database client process, 
each symbol named matching the substring is a hypertext links to a symbol in files query as 
described herein below. 

Symbol in files A symbol in f Qes query is generated by the web browser to request a list of lines that contain a 

specified symbol. Each line returned by the database client process Includes the filename, line 
number, and line of text including tiie requested symbol. The filename as returned from the 
database client is a hyperlink specifying a file contents query for the corresponding file as 
described above. 

[0067] Those skilled in the art will readily recognize that other query commands may be included within the scope of 
the present invention. Furthermore, tfiose skilled in the art will recognize that subsets of the above described queries 
as well as other substituted queries relating to symbols within collections of text documents are within the scope of 
present invention. The above kJentified four query commands are intended as examples of a leeful set of queries to 
permit rapid user searching for symbols or keywords in collections of text documents. 

[0068] As noted above with respect to element 1008. certain search results are preferably returned from the server 
process in a tokenized format. In particular, in the preferred embodiment of the present invention, the results of a file 
contents query are returned in tokenized form such that all symbols in the database file are tokenized in the file contents 
returned from the database server to the database dieni Other exemplary queries listed above generate search results 
from the server in a predefined format Element 1008 above accepts all such formats for retumed search results, token- 
ized and non-tokenized pre-defined formats), and converts them to pages having hyperlinks for items therein having 
likely interest for the user's next search request. The hyperlinks define a query for more information regarding the cor- 
responding symbol or keyword. 

[0069] As noted herein below, several of the above identified queries (In particular the Symbol in f Oes query com- 
mand) permit options to be specified to control the searching performed by the database sender process to satisfy the 
query command. Certain such parameters are meaningful for particular types of text documents as processed by opti- 
mized parsers (as discussed above with respect to database buiki processing techniques). For example, in searching 
for symbols in C language source programs, it is often useful to search for symbol with or without case sensitive match- 
ing. Further it may be useful to search for a symbol outside of curly braces (i.e., to identify global symbol declarations 
as opposed to symbol references). Other such search parameters may include stripping leading underscore characters 
from symbols when matching for a requested substring. Another parameter may specify that the returned results should 
not be tokenized (as described herein below) and hence returned faster. For exarrple, if the user requests the contents 
of a large text document or queries for a syn^ol likely to be found frequently in the collection of text documents, the user 
may realize in advance that the links are not required for subsequent searches. Specifying the "not tokenized" param- 
eter allows the query results to be returned more quickly The "not tokenized" parameter also permits the returned 
results to be usable for other than web browsing. For example, the returned infbmnation may be saved in a file. 
[0070] Those skilled in the art will recognize a wide variety of such optior^ that may be supported by the database 
server process and hence supported in the interface between the web browser and the database client process. TTie 
above list is intended merely as exemplary of the types of search parameters which may be specified in addition to the 
symbol or keyword search terms specified by the web browser user. 

[0071 J Present web browsers typically allow a displayed page (e.g. , an HTML page) to specify that it is or is not each- 
able. The database client proc^ of the present invention therefore sets appropriate attributes on records retumed to 
the web browser to ensure that the records are cached locally by the web browser. Subsequent requ^s for other lines 
in a file may be satisfied locally by the web browser recognizing the information as resident in its local cache. In partic- 
ular, for example, if the user issues a query identical to an eariier query, the web browser can recognize the earlier 
search results in its cache and speed the presentation of the results tO'the user. Or for example, if a user issues a query 
to display an entire text file, tiie file is presented to the user at a starting line number indicated by the user. If a subse- 
quent query requests the same file, but perhaps a different starting line number, tiie web browser will recognize that the 
entire file is already cached and speed the display of flie requested portion of tiie file to the user. 

DATABASE CLIENT/DATABASE SERVER PROTOCOL 

[0072] As noted herein above, the database dient process re-formats tiie search command and parameters supplied 
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to It by the web browser into an internal request and response forniat defined and supported by the database server 
process. The database server process defines a stateless version of the cormTands supported as well as a state based 
version of the supported search commands. In the state based version, ttie server process retains some state informa- 
tion regarding the processing requested by the database client process. For example, a first command descn^bes the 
database file path name to be used for processing of queries. A second command specifies, for example, a particular 
query to be performed. When processing the second command, the database server process uses saved state infor- 
mation from prior commands to identify, for example, the path name of the database file to be used in satisfying tfie 
query. In a state based model such as this, a connection with a particular client process requires saved state informa- 
tion regarding that connection. In other words, tiie server process must maintain state information for each presentiy 
active client connection. Furttier, an active client connection must be "closed" to recover the resources in ttie server 
dedicated to tiiat open connection. 

[0073] In a stateless model, the best presentiy known vnoie for practicing flie present invention, the database sender 
process retains no such state information. Rather, each command (query request) received provides all information 
necessary to process tiie command (e.g., the database path name plus all values and parameters needed to process 
the query request). The connection with a dient process exists only for the duration of processing that request. No state 
information is retained between such requests. The state based nrKxJe of practicing the present invention is however 
useful, as noted above, for development, testing, and debug of the database server process independent of tiie data- 
base client process (i.e.. using a simple ASCII text command interface wherein saved state information need not be re- 
entered for testing of each command). 

[0074] An exemplary preferred ennbodiment of the protocol used in communicating between a database client process 
and a database server process is described below. Rrst, commands designed around a slate based model of tiie inter- 
face are presented followed by the equivalent commands for ttie stateless nrradel. In all cases below, a request format 
is shown with the label "REQ" and tiie associated response is labeled as "REPLY" 

[0075] Responses generated by many of the commands listed below are "tokenized" in that the all symbols or key- 
words in the search results are returned as tokens (delimited by TOKEN^START and TOKEN_END delimiter bytes). 
Specifically, a file contents query issued by the user generates a QFILE server query command (as described bdow). 
The server process returns ttie entire file contents as an ASCII text stream wherein each symbol in tiie ASCII text 
stream is delimited as a token. The database client process, in turn, translates each token so delimited in tiie search 
results into a hyperlink for performing a further query on that symbol. Sill more specifically, the tokens in tiie search 
results are preferably transformed by tiie database client process into hyperlinks for locating associated information 
rapidly 
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EOF_MARKER =0 
T0KEN_START=:1 
TOKEN_END =1 

<> == replaced by described information/parameters 
[ ] == optional information in protocol 
{ } == annotation, not part of client/server protocol 
I == alternate selections 

REQ: DBPATH <database path> 

RESP: 1 1 0(: <error message>] 

This command and reply essentially establishes a connection between a client 
process and a server process and specifies the path name for the database file 
to be used for queries processed in this open connection. The reply simply 
indicates success of failure. In the case of failure an error message may be 
appended. 



REQ : QFILES <keyword> 
RESRF Filel 

F File2 
F Files 
<repeat F ...> 



REQ : QPATHS <keyword> {same as QFILES} 



RESP 



F Filel 

F File2 
F Files 
<repeat F 



These commands (essentially synonyms) return a list of file names in the 
presently open database file whose path names include the specified keyword 
substring. 



REQ : QLINES <keyword> <case_sensitive: 0 or 1 > <outside_of_{ }: 0 or 1 > 
RESP:C <common j)ath> 

R <relatlvej)ath> {may be null} 

L<line#> <line contents> 
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{repeat R and L records for all lin s in all files} 



This command returns a list of files and lines In each file wher th specified 
keyword is located in the collection of text documents assodated with the 
presently open database file. The case_sensitive parameter specifies whether 
the case of the keyword is to be considered in performing the search. The 
outside_of_{ } parameter specifies that the search is to locate only matching 
keywords that are outside the scope of all C programming language blocks 
(delimited by pairs of curiy braces). 



REQ : QSYMS <keyword> <case_sensitive: 0 or 1 > 
RESP:S <symname> 

{repeat S records for all matching symbols} 



This command returns a list of all symbols found in the database which include 
(as a substring) the supplied keyword. As above, the case_sensitive parameter 
may be specified to indicate the relevance of the case of the keyword parameter 
in the search. 



REQ : QFILE <full_pathnanrie: common jDath + relative_path> 
RESP: 1 1 0[: <error message>l 

B <number of bytes> 

<bytes of data> 

This command returns the entire contents of a file specified by its full file name 
as a parameter. First the number of bytes to be returned Is returned (i.e., the 
length of the tokenized byte stream to follow) followed by the tokenized byte 
stream as described below. 



<bytes of data> == 

<data><TOKEN_START><token_data><TOKEN_END><data> { repeat ) 

The tokenized byte stream format described above is the entire content of a 
requested file where each symbol in the text strearri vtiich was parsed by the 
datat>ase builder process and hence entered into the database file is identified 
as a token. As noted elsewhere, the database client process transforms this 
tokenized stream into a corresponding page with hyperiinks for display. Each 
token is transformed into parameters appropriate for a QLINES command. 
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REQ : Q VERSION 
RESP:V <verslon string> 

This command mer ly returns a version number for the database server 
process. This allows the database di nt to adapt to upgrades in the features of 
the sender process. 



REQ : QUIT 

RESP:{none - socket disconnected} 

This command terminates an open connection to a client (in the state model). 



The following commands represent extensions to the client/server protocol of the 
present invention which provide for stateless operation as is preferred. 



REQ : Q PATHS <database path length> <database path> <keyword length> 

<keyword> 
RESP:1 1 0[: <error message>l 

{see QPATHS for remainder} 

This command is identical in operation to the combination of a DBPATH 
command and a QPATHS command as described above. This command 
combines the parameters and return information to provide a stateless version 
of the command. The connection with the client is closed following completion 
of the command. The return data is as described above. 



REQ : Q LINES <database path length> <database path> <keyword length> 
<keyword> 

<case_sensitive: 0 or 1> 
<outside_of_{ }: 0 or 1 > 
<matchJeadingLunderscore: 0 or 1> 
RESP:1 1 0[: <error message>l 

B <max line number> <number of lines to follow> 
(see QLINES for remainder} 

This command is essentially identical in operation to the combination of a 
DBPATH command and a QLINES command as described above. This 



EP0 924 628 A2 



command combines the parameters and return information to provide a stateless 
version of the command. Th connection with the client is dosed following 
completion of the command. The "B" return value include the number of lines 
to be returned so that the user may as early as possible determine whether the 
results are worth viewing. The return data is as described above. 



REQ : Q SYMS <database path length> <database path> <keyword length> 
<keyword> 

<case_sensiti ve: 0 or 1 > 
RESP:1 1 0[: <error message>] 

{see QSYMS for remainder} 

This command is essentially identical In operation to the combination of a 
DBPATH command and a QSYMS command as described above. This 
command combines the parameters and return information to provide a stateless 
version of the command. The connection with the client is dosed following 
completion of the command. The return data is as described above. 



REQ : Q FILE <database path length> <database path> <size of full path> 
<full_path: common j)ath + relative jDath> <tokenized: 0 or 1> 

RESP:1 1 0[: <error message> 

{see QFILE for remainder} 

This command is essentially identical in operation to the combination of a 
DBPATH command and a QLINES command as described above. This 
command combines the parameters and return information to provide a stateless 
version of the command The connection with the dient is dosed following 
completion of the command. The return data is as described above. 



EXEMPLARY SCREEN DISPLAYS 

[0076] Rgures 4 through 7 are exenplary screen displays on a web browser which typify the operation of the present 
invention. In particular, figure 4 is a screen display exemplifying the query and response for a Q LINES query (as 
described above). Cheddsoxes 400-406 select the type of query operation desired as niarked on the textual label asso- 
ciated with each cheddbox. These operations correspond to the four operatior^ supported in the web browser to data- 
base client process interface. As shown in figure 4. a symbol in files query is requested by virtue of checkbox 400 being 
marked. This query request is transformed by the database client process into a Q LINES server request 
[0077] Checkboxes 408-414 are used to select search options appropriate for the type of search requested. Tlie 
parameters con-e^ond to the textual label associated with each checkbox on figure 4 and as described herein above. 
As shown in figure 4. the user has requested that the Symbol in files query request match the case of the supplied key- 
word with the tokens in the database file search. 

[0078] Query box 41 6 permits the user to enter a keyword which Is to be searched by the query request In particular, 
the query spedf ied by the user as exenplif led in figure 4 Is to search for the symbol "abort" In all files and to match the 
lower case specified by the user. 

[QQfTB] Buttons 418 and 420 are used to control operation of the t>rowser. In particular, the user clicks button 418 to 
evaluate (perform) the search specified in the query and checkboxes. The user clicks button 420 to dear the search 
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parameters and query keywords. 

[0080] Any common portion of the path names of all files in the collection of text documents is shewn at label 450. 
List 452 displays the results of the Symbol in files query request. All lines in the collection of text documents which con- 
tain the keyword "abort" (in lower case) are displayed including the file name (the relative portion of the path name 
5 devoid of the common portion of the path name shown at 450), the line number and the line of text from the correspond- 
ing file. 

[0081] Each file path name in the result list 452 is a hypertext link to generate a RIe Contents query for the con-e- 
sponding query. By clicking on the link, the user navigates to a listing of the file contents of the corresponding file. 
[0082] Rgure 5 shows a similar query to that shown in figure 4 but with the Outside of { } checkbox 410 marked. As 
10 can be seen in figure 5, the results list 454 shows only the subset of lines listed in figure 4 at 452 in which "abort" 
appears outside curly braces (i.e., in global declaration contexts). 

[0083] Rgure 6 is an exemplary screen display showing a Substring in Symbols query (as indicated by the mark in 
checkbox 402). The query in box 416 requests that the server locate all symbols having "framejof as a substring be 
displayed. The result list 456 shows all such symbols which contain the substring "frame_of." Each symbol in the results 

15 list is a hypertext link to generate a Syn^l in Rles query request for the corresponding file. 

[0084] Rgure 7 is another exemplary screen display typifying a Substring in Paths query request and results (as indi- 
cated by the marked checkbox 404). The query specifically requests a list of all file names (paths) which contain the 
substring "lib" as entered in box 41 6. Results list 458 shows the relative portion of all paths known in the database which 
contain the specified string as a substring. Each listing in the results list is a hypertext link to generate a ffle contents 

20 query request for the corresponding file. 

[0085] While the invention has been illustrated and described in detail in the drawings and foregoing description, such 
illustration and description is to be considered as exemplary and not restrictive in character, it being understood tiiat 
only tiie preferred embodiment and minor variants thereof have been shown and described and that all changes and 
modifications tiiat come witiiin tiie spirit of the invention are desired to be protected. 

25 

Claims 

1 . A system for searching for symbols in a text documait using a web browser (100) CHARACTERIZED IN THAT tiie 
system conrprises: 

30 

a database file (1 08) identifying locations of said symbols in said text document (106); and 
a database search process (102, 104), associated with said w^ browser (100). for performing queries (150) 
on said database file (108) on behalf of said web browser (100). wherein said database search process (102. 
104) returns results (160) of said queries (150) to said web browser (100) in a format having hyperlinks corre- 
35 spending to symbols in said results of said queries. 

2. The system of claim 1 wherein said database search process includes: 

a database client process (102) associated with said web browser (100); and 
40 a database server process (1 04) assodated with said database client process (1 02) for processing said que^ 

ries (1 50) in said database file (1 08), 

wherein said database client process (102) is adapted to receive said queries (150) from said web browser 
(100) and to fbnvard said queries (152) to said database server process (104) for processing, and 
wherein said database client process (102) is adapted to receive results of processing said queries (158) and 
45 is adapted to return said results (160) to said web browser (100) in a fomnat having hyperlinks corresponding 

to symbols in said results. 

3. TTie system of claim 1 furUier conprising: 

50 a database builder process (1 1 0) for cor^ructing said database file (1 08). 

4. The system of claim 3 wherein said database builder process includes: 

a parser (112) for identifying said symbols in said text document (106). 

55 

5. The system of claim 3 wherein said document has a document type associated tiierewitii and wherein said data- 
base builder process includes: 
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a plurality of parsers (1 12, 1 14) wherein each of said plurality of parsers is adapted to parse a particular type 
of document and wherein said builder process (110) selects a particular one of said plurality of parsers (112. 
114) adapted to parse documents of said document type. 

5 6. The system of claim 2 wherein said database server process is multi-threaded. 

7. A method for searching for symbols in a text document using a web browser CHARACTERIZED IN THAT the 
method connprises the steps of: 

10 providing a database file (828, 108) identifying locations of said symbols in said text document (106); and 

constructing, within said web browser, a database query (900-902) to locate said symbol in said text document 
(106): 

performing said query (1000-1 1 10) in said database file (108) on behalf of said web browser; 
receiving, within said web browser, results of said query (904) in a format having hyperlinks corresponding to 
15 symbols in said results of said query; and 

displaying said results (906) in said web browser. 

8. The method of claim 7 

20 wherein the step of constructing a database query includes the step of transmitting said query to a database 

client process (902), 

wherein the step of receiving results includes the step of receiving results from said database client process 
(904. 1010). and 

wherein the step of performing said query includes the steps of: 

25 

transmitting said query from said database client process to a database server process (1 004); 
executing said query within said database server process (1 104); and 

returning results of said query from said database server process to said database client process (1 1 10). 

30 9. The method of claim 8 wherein the step of displaying includes the step of transforming, within said database client 
process, said results received from said database sender process into hTTML format (1008). 

1 0. The method of claim 8 wherein the step of performing said query further includes the step of creating an independ- 
ent thread (1 1 02). within said database server process, for performing the steps of executing said query and return- 
35 ing results of said query to said database client process (1 1 04-1 1 1 0). 



40 



45 



50 
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FIG. 2 



HASH TABLE 




204 




■> ... 



FILf INDEX 1 



206 



FILE INDEX 2 



UNE#1 



FILE INDEX 3 



TOKEN 1 




-> 


TOKEN 2 




— ► 


TOKENS 




— ► 


TOKEN 4 



























208 




UNE#2 



EP 0 924 628 A2 




EP0 924 628A2 



FIG. 4 
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FIG. 5 
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FIG. 7 
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FIG. 8 
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